Returning scRNA dataset of size (2000, 3000)
dict_keys(['endog', 'labels', 'labels_1hot'])
pyPLNmodels: analyze multivariate count dataGroupe de travail StatOmique
MIA Paris-Saclay, INRAE
MIA Paris-Saclay, INRAE
MaIAGE, INRAE
MIA Paris-Saclay, INRAE
10 mars 2025
| FTL | MALAT1 | FTH1 | TMSB4X | CD74 | SPP1 | APOE | B2M | ACTB | HLA-DRA | ... | RBM22 | FBXL15 | SNAPIN | OSCAR | NUDT22 | CASP8 | DNAJA4 | COA6 | SMG1 | PLEKHF1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | ... | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 | 2000 |
| mean | 62 | 119 | 36 | 65 | 34 | 10 | 13 | 55 | 33 | 22 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| std | 224 | 112 | 95 | 92 | 88 | 69 | 67 | 62 | 61 | 60 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| min | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25% | 2 | 35 | 2 | 15 | 1 | 0 | 0 | 20 | 5 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 50% | 5 | 94 | 7 | 36 | 3 | 0 | 0 | 41 | 14 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 75% | 17 | 169 | 22 | 74 | 16 | 0 | 1 | 66 | 36 | 7 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| max | 4129 | 1171 | 2006 | 1179 | 1010 | 1792 | 1197 | 771 | 1463 | 500 | ... | 4 | 4 | 4 | 5 | 5 | 3 | 9 | 7 | 5 | 14 |
8 rows × 3000 columns
T_cells_CD4+ 734
T_cells_CD8+ 720
Macrophages 546
Name: standard_true_celltype_v5, dtype: int64
PLN (Aitchison et Ho 1989) is a multivariate generalized linear model, where
\[ \quad \mathbf{Y}_i | \mathbf{Z}_i \sim^{\text{iid}} \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim \mathcal{N}({\underbrace{\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}}}_{{\boldsymbol\mu}_i},\boldsymbol\Sigma). \]
Typically, \(n\approx 10000s\), \(p\approx 10000s\), \(d \leq 10\)
\[\mathbb V[Y_{ij}] > \mathbb E[Y_{ij}] \]
Suppose we aim to predict the cell type based on the scMARK dataset.
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
def get_classif_error(data, y):
data_train, data_test, y_train, y_test = train_test_split(data, y, test_size=0.33, random_state=42)
lda = LDA()
lda.fit(data_train, y_train)
y_pred = lda.predict(data_test)
return np.mean(y_pred != y_test)from pyPLNmodels import Pln
latent_variables = Pln(rna["endog"]).fit().latent_variables
print("Classif error: ", get_classif_error(latent_variables, rna["labels"]))Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 3.3 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Classif error: 0.17575757575757575
Dimension Reduction: rank constraint matrix \(\boldsymbol\Sigma\). (Chiquet, Mariadassou, et Robin 2018) \[\mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top), \quad \mathbf{C} \in \mathcal{M}_{pk} \text{ with orthogonal columns}.\]
Network inference: sparsity constraint on inverse covariance. (Chiquet, Mariadassou, et Robin 2019) \[\mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol\Sigma = \boldsymbol\Omega^{-1}), \quad \|\boldsymbol\Omega \|_1 < c.\]
Unsupervised: mixture model in the latent space (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{Z}_i \mid c_i = k \sim \mathcal{N}(\boldsymbol\mu_k, \boldsymbol\Sigma_k), \quad \text{for unknown memberships } c_i.\]
Supervised: maximize separation between groups with means (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{Z}_i \sim \mathcal{N}(\sum_k {\boldsymbol\mu}_k \mathbf{1}_{\{c_i= k\}}, \boldsymbol\Sigma), \quad \text{for known memberships } c_i.\]
Zero-inflation: Add a zero inflation latent variable (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{W}_i \sim \mathcal{B}(\sigma(\boldsymbol{\mu}^{0}_i)), \quad \mathbf Y_i \sim (1 - \mathbf W_i)\odot \mathcal P(\exp(\mathbf Z_i))\]
Dimension reduction and Zero Inflation: Add a rank constraint on \(\boldsymbol \Sigma\)
Estimate \(\theta = (\mathbf{B}, {\boldsymbol\Sigma})\), predict the \(\mathbf{Z}_i\), while the model marginal likelihood is
\[\begin{equation*} p_\theta(\mathbf{Y}_i) = \int_{\mathbb{R}_p} \prod_{j=1}^p p_\theta(Y_{ij} | Z_{ij}) \, p_\theta(\mathbf{Z}_i) \mathrm{d}\mathbf{Z}_i \end{equation*}\]
Untractable likelihood
EM requires to evaluate (some moments of) \(p_\theta(\mathbf{Z} \,|\, \mathbf{Y})\), but there is no close form!
\(\implies\) variational EM
Use a proxy \(q_\psi\) of \(p_\theta(\mathbf{Z}\,|\,\mathbf{Y})\) minimizing a divergence in a class \(\mathcal{Q}\) (e.g, Küllback-Leibler)
\[\begin{equation*} q_\psi(\mathbf{Z})^\star \in \arg\min_{q\in\mathcal{Q}} KL\left(q(\mathbf{Z}), p_{\theta}(\mathbf{Z} | \mathbf{Y})\right). \end{equation*}\]
and maximize the ELBO (Evidence Lower BOund)
\[\begin{equation*} J(\theta, \psi) = \mathbb{E}_{\psi} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q_\psi(\mathbf{Z})] \end{equation*}\]
Standard approach: alternative maximization of the ELBO: VE-step and M-step.
Brute-force approach:
\[(\theta^{(t+1)}, \psi^{(t+1)}) = (\theta^{(t+1)}, \psi^{(t+1)}) + \eta_t \nabla_{\theta, \psi} J(\theta^{(t)}, \psi^{(t)})\]
Variance is available for the regression coefficient of PLN model (only) \(\implies\) confidence intervals and hypothesis testing.
Estimator may be biased for some models
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 2.8 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
from pyPLNmodels import Pln
pln = Pln(rna["endog"], exog = rna["labels_1hot"], add_const = False)
_ = pln.fit()Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 2.4 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Available when calling \(\texttt{print(model)}\)`:
A multivariate Pln with full covariance.
======================================================================
Loglike Dimension Nb param BIC AIC
-226568.5 100 5350 245046.7454 231918.5
======================================================================
* Useful attributes
.latent_variables .latent_positions .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
None
* Additional methods for Pln are:
.get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()
The \(\texttt{PlnNetwork}\) allows to infer genes that are highly correlated, by imposing a constraint:
\[\| \Sigma^{-1} \|_1 \leq \text{penalty}\]
\[\Sigma^{-1}_{ij} = \text{Corr}(Z_i, Z_j)\]
Still differentiable ELBO.
Setting the offsets as the log of the sum of `endog`.
Fitting a PlnNetwork model with with penalty 2000
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 2.9 seconds.
Last criterion = 5.34e-06 . Required tolerance = 1e-06
The \(\texttt{PlnLDA}\) allows to predict newly observed data. It performs LDA on the latent space.
from pyPLNmodels import PlnLDA, plot_confusion_matrix
ntrain = 500
endog_train, endog_test = rna["endog"][:ntrain],rna["endog"][ntrain:]
labels_train, labels_test = rna["labels"][:ntrain], rna["labels"][ntrain:]
lda = PlnLDA(endog_train, clusters = labels_train).fit()
pred_test = lda.predict_clusters(endog_test)Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 1.9 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Doing a VE step.
Done!
Setting the offsets to zero.
Doing a VE step.
Done!
Setting the offsets to zero.
Doing a VE step.
Done!
import seaborn as sns
from pyPLNmodels import ZIPln
zi = ZIPln(rna["endog"], exog = rna["labels_1hot"], add_const = False).fit()Setting the offsets to zero.
Fitting a ZIPln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 6.5 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Use two latent vectors \(\mathbf{W}_i\) and \(\mathbf{Z}_i\) to model excess of zeroes and dependence structure \[\begin{array}{rrl} \text{PLN latent space} & \boldsymbol{Z}_i = (Z_{ij})_{j=1\dots p} & \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{B}, \mathbf{\Sigma}), \\[1.5ex] \text{excess of zeros} & \boldsymbol{W}_{i} = (W_{ij})_{j=1\dots p} & \sim \otimes_{j=1}^p \cal B(\pi_{ij}), \\[1.5ex] \text{observation space} & Y_{ij} \, | \, W_{ij}, Z_{ij} & \sim (1-W_{ij})\mathcal{P} \left(\exp\{o_{ij} + Z_{ij}\}\right),\\ \end{array}\]
\(\rightsquigarrow\) Better handling of zeros +additional interpretable parameters.
Still overdispersion \[\mathbb V[Y_{ij}] > \mathbb E[Y_{ij}]\]
\[ \pi_{ij} = \sigma( \boldsymbol X^{0}\boldsymbol B^0)_{ij}, ~ \boldsymbol X^0 \in \mathbb R^{n \times d_0}, ~ \boldsymbol B^0 \in \mathbb R^{d_0\times p}\]
Identifiability of the model is ensured if the covariates \(X^0\) are full rank.
\[W_{ij} | Y_{ij} \sim \mathcal{B}\left(\frac{\pi_{ij}}{ \varphi\left(\mathbf{x}_i^\top \boldsymbol B_j, \Sigma_{jj}\right) \left(1 - \pi_{ij}\right) + \pi_{ij}}\right) \boldsymbol 1_{\{Y_{ij} = 0\}}\]
with \(\varphi(\mu,\sigma^2) = \mathbb E \left[ \exp(-X)\right], ~ X \sim \mathcal L \mathcal N \left( \mu, \sigma^2\right)\).
zi = ZIPln.from_formula("endog ~ site + time", micro).fit()
pln = Pln.from_formula("endog ~ site + time", micro).fit()`exog_inflation` and `exog` are set to the same array. If you need different `exog_inflation`, specify it with a pipe: '|' like in the following: endog ~ 1 + x | x + y
Setting the offsets to zero.
Now dataset of size torch.Size([300, 492]).
Fitting a ZIPln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 15.2 seconds.
Last criterion = 5.49e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Now dataset of size torch.Size([300, 492]).
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 7.5 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
\[\boldsymbol \Sigma_{\epsilon} = \boldsymbol \Sigma - \Phi\boldsymbol \Sigma \Phi^\top \succ 0\]
“Easy” cases:
Non positive definite case:
More complex case:
\[ \begin{array}{c|cc} \Phi \ \backslash \ \Sigma & \text{Diagonal} & \text{Full} \\ \hline \text{Scalar} & \checkmark & \checkmark \\ \text{Diagonal} & \checkmark & \times \\ \text{Full} & \times & ?? \\ \end{array} \]
Data: number of exchange of genetic material between two DNA strand in a loci.
All the 27 chromosomes placed end to end \(\implies\) 500 loci, for \(\textbf{4 different species of sheep}\).
\(\implies\) 4 time series of 500 loci. Each loci depends on the previous one.
from pyPLNmodels import load_crossover, PlnAR
cross = load_crossover()
ar = PlnAR(cross["endog"], ar_type = "diagonal").fit()Returning crossover dataset of size (500, 4)
Setting the offsets to zero.
Fitting a PlnAR model with autoregressive type diagonal.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 1.4 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Pied de page